## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## quality
## 1 5
## 2 5
## 3 5
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
There six levels of quality of red wine in our dataset: 3,4,5,6,7,8. From the above bar chart, we can see two peaks in qualuty of 5 and 6. These two level of quality account for 82% of observations in the dataset. So the quality classes are ordered but not balanced. There are much more normal wines than excellent or poor ones. Let’s classify the wines by setting and cutoff for our dependent variable- wine quality at 4 and at 7. In other words, I woule regard quality 7 and above as ‘good wines’, wines between 4 and 6 as ‘mediocre wines’ and quality scores below 4 as ‘inferior wines’.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality.class
## Min. : 8.40 Min. :3.000 [0,5) : 63
## 1st Qu.: 9.50 1st Qu.:5.000 [5,7) :1319
## Median :10.20 Median :6.000 [7,10): 217
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Based on data structure, there are 11 input attributes that may influence the quality: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol. Before further analysis, I wonder how is the distribution like for these attributes in the dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Looking at above plots, it appears volatile acidity and fixed acidity have quite similar distribution. What’s the relationship between these two acidity? Are they highly correlated? I would dive into these two variables in the bivariate analysis section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Most of wines have citric acid content less than 0.42. 132 wines have zero citric acid. Based off some external research, citric acid is found only in very minute quantities in wine grapes but can serve as inexpensive supplement by winmaker in acidification to boost the wine’s total acidity. In the European Union, use of citrix acid is limited, which can explains the low content of citric acid.
My observations: - pH level of the dataset seems to be normally distributed. The maximum pH is 4.0, meaning all wines in the dataset are highly acidic. Most wines have pH level between 3.2 and 3.4. - 99% of red wines’ Chlorides content is less than 0.36 - Sulphates distribution is right skewed with quite few wines that have more than 1.26 g/dm^3. - Density is normally distributed. The minimum density is 0.99, the maximum density is 1.0037. Most wines’s desity fall between 0.9956 and 0.9978.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
All red wines in the dataset have less than 15% alcohol content. Most of them fall between 9.5% and 11%.
Applying log transformation to sugar residuals to avoid long tail data. In term of residual sugar content, most wines fall between 1.9 and 2.6. Let’s zoom in
##
## 0.9 1.2 1.3 1.4 1.5 1.6 1.65 1.7 1.75 1.8 1.9 2 2.05 2.1 2.15
## 2 8 5 35 30 58 2 76 2 129 117 156 2 128 2
## 2.2 2.25 2.3 2.35 2.4 2.5 2.55 2.6 2.65 2.7 2.8 2.85 2.9 2.95 3
## 131 1 109 1 86 84 1 79 1 39 49 1 24 1 25
## 3.1 3.2 3.3 3.4 3.45 3.5 3.6 3.65 3.7 3.75 3.8 3.9 4 4.1 4.2
## 7 15 11 15 1 2 8 1 4 1 8 6 11 6 5
## 4.25 4.3 4.4 4.5 4.6 4.65 4.7 4.8 5 5.1 5.15 5.2 5.4 5.5 5.6
## 1 8 4 4 6 2 1 3 1 5 1 3 1 8 6
## 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.55 6.6 6.7 7 7.2 7.3 7.5
## 1 4 3 4 4 3 2 3 2 2 2 1 1 1 1
## 7.8 7.9 8.1 8.3 8.6 8.8 8.9 9 10.7 11 12.9 13.4 13.8 13.9 15.4
## 2 3 2 3 1 2 1 1 1 2 1 1 2 1 2
## 15.5
## 1
The least residual sugar content is 0.9 and the highest is 15.5. Above is the main body of sugar residual. We can see some residual sugar contents occur more often than others. Many of these common content value is in one decimal format, which may be resulted from measuring method. In further analysis, I would look into the correlation between sugar residuals and other attributes and the influence of sugar residual to the quality of wines.
In term of sulfur dioxide content, the distributions are right skewed. There are some outliers, which is larger than 90% of overall dataset. Let’s remove these outliers.
As sulfur dioxide would transfer to sulfuric acid when resolving, I wonder the correlation to the acidity and PH. Let’s plot a correlation matrix later to get some idea of the factors that drive higher quality.
There are 1599 red wines data in the dataset with 11 features(fixed acidity, volatile acidity, citric acid, sugar residual, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates and alcohol). The quality output are integars from 3(worst) to 8(best) in the dataset. All features are numeric. Following are some major findings: - Unit of measurement of fixed acidity, volatile acidity, citric acid, residual suga, chlorides, free sulfur dioxide, total sulfur dioxide, sulphates is g / cm^3, - Output variable quality is based on sensory data, There six levels of quality of red wine in our dataset: 3,4,5,6,7,8. Quality of 5 and 6 account for 82% of observations in the dataset. The quality classes are ordered but not balanced. There are much more normal wines than excellent or poor ones. - Some of the variables, e.g. free sulphur dioxide, density have some outlier. Most outliers are on the larger side. - Residual sugar has a positively skewed distribution; even after eliminating outliers, distribution will remain skewed. The least residual sugar content is 0.9 and the highest is 15.5. The main body of sugar residual is between 1.9 and 2.6. Some residual sugar contents occur more often than others. - pH of the dataset seems to be normally distributed. The maximum pH is 4.0, meaning all wines in the dataset are highly acidic. Most wines have pH velue between 3.2 and 3.4. - The distributions of sulfur dioxide content, for both free and total content, are widely spread and right skewed with median being 14.0(free) and 38.0(total)
The main features in the data set are alcohol, quality and quality.bucket. I’d like to determine which features are best for predicting the quality of a diamond. I suspect alcohol and some combination of the other variables can be used to build a predictive model to classify red wines by sensory quality.
Alcohol, sulfur dioxide(free& total), residual sugar and acidity(fixed& volatile) are likely contribute to the quality of a wine. Based on my external research on quality of red wines, great wines are in balance with their 4 fundamental traits (Acidity, tannin, alcohol and sweetness). Applying to our dataset, I think alcoho, residual suagr, volatile acidity probably contribute most to the quality of red wines.
I classified the quality variables into new variable quality.class by setting up cutoff at 4 and 6. The counts for three quality buckets: [0,4):63; [4,6):1319; [6,10):217;
Of the features I investigated above, sulfur dioxide content distributions for both free and total are highly right skewed with some outliers in extreme high values. I filtered out content data points that are higher than 90% percentile to abtain a more balanced distributed plot.Besides, I applied log transformation to residual sugar to abtain a normally distributed plot.
Let’s begin exploring the relationship between different features!
The correlation matrix and correlation coefficient indicate following patterns in different attributes:
Based off the density plot above, we can see a peak in quality at 5. Higher quality red wines tend to have higher alcohol content. Also, outliers are observed especially in quality 5. After subsetting outliers to filter out data points outside 10% and 90% quantile and classifying quality into three different classes, we get the following boxplot
Basically, high-quality red wines [6,10) tend to have the highest alcohol content and quality scores between 4 and 6 tend to have lowest alcohol content. For quality scores between 5 and 8, alcohol content would increase the quality scores.
## df$quality.class: [0,5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.22 11.00 13.10
## --------------------------------------------------------
## df$quality.class: [5,7)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.00 10.25 10.90 14.90
## --------------------------------------------------------
## df$quality.class: [7,10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.60 11.52 12.20 14.00
Correlation coefficient: -0.39
## df$quality.class: [0,5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## --------------------------------------------------------
## df$quality.class: [5,7)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## df$quality.class: [7,10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
As the correlation coefficient between volatile acidity and quality is -0.39, the boxplot indicates red wines in high quality class tend to have lower volatile acidity content. We can safely say, volatile acidity can be serve as a factor to predict quality of red wines.
From obove boxplot, outliers are observed, especially in the mediocre quality class with score 4 and 5. We should pay caution when building predition models. In term of central tendency across differnet quality classes, poor-quality wines tend to have lower pH level and higher density. There is no significant difference in residual sugar content across quality classes. Residual sugar content is not a good feature to predicting the quality of red wines.
Citric Acid is part of fixed acidity according to my external research. The third plot shows a significant linear relation between these two features. Fixed acidity is highly negative correlated with pH, which is undertandable according to definition of pH level. Citric acid has nagative correlation with pH level as well
A negative correlation is observed between volatile acidity and citric acid, which may caused by the process during winemaking that the citric-sugar co-metabolism can increase the formation of volatile acid in wine.
Most data points are clustered within 0.995 and 1.000 g/dm3 range for the density. Based off the regression line, we can conclude that density is positively correlated with fixed acidity and negatively correlated with alcohol.
Free sulfur dioxide and total sulfur dioxide content spread in a pretty wide range. We can safely conclude a significant positive correlation between there two features.
High-quality red wines with socres between [6,10) tend to have higher alcohol content and quality scores between 4 and 6 tend to have lowest alcohol content. For quality scores between 5 and 8, alcohol content boosts quality scores.
As the correlation coefficient between volatile acidity and quality is -0.39, red wines in higher quality class tend to have lower volatile acidity content.
Positive Correlation: - Free sulfur dioxide vs Total sulfur dioxide - Density vs Fixed Acidity - Citric Acid vs Fixed Acidity
Negative Correlation: - Volatile Acidity vs Citric Acid - Fixed.acidity vs pH - Citric Acid vs pH
Density and fixed acidity show significant negative correlation across all classes of qulity.
## # A tibble: 6 x 3
## quality density_median alcohol.median
## <fct> <dbl> <dbl>
## 1 3 0.998 9.93
## 2 4 0.996 10.0
## 3 5 0.997 9.70
## 4 6 0.997 10.5
## 5 7 0.996 11.5
## 6 8 0.995 12.2
For red wines with quality scores higher than 3, the median of density generally decreases as the quality score increases. For red wines with quality scores higher than 3, the median of alcohol content increases asthe quality score increases. Across different wine qualities, Density and Alcohol indicate significant negative correlation. High quality red wines cluster in the bottom-right side of the plot, which means good wines tend to have high alcohol conent and low density value.
There is no significant correlation between density and volatile acidity. Most of data points cluster within the range of 0.995 to 1.000 for density and 0.3 to 0.7 for volatile acidity content. It seems no distinctive patterns for wines of different quanlity.
High quality wines tend to have higher alcohol content along with lower volatile acidity.
Based off scatter plots with linear regression line above, it seems the variance of regression is alway higher in quality range (0,5) as compared to that of remaining quality range, which might be caused by fewer data points we have in this quality class. We should handle data in this class with extra caution to build prediction models.
Quality vs Alcohol & Density: - For red wines with quality scores higher than 3, the median of density generally decreases as the quality score increases. - For red wines with quality scores higher than 3, the median of alcohol content increases as the quality score increases. - Across different wine qualities, Density and Alcohol indicate significant negative correlation. High quality red wines cluster in the bottom-right side of the plot, which means good wines tend to have high alcohol conent along with low density value.
Quality vs Alcohol & Volatile Acidity: - High quality wines tend to have higher alcohol content along with lower volatile acidity. - There is no significant relationships between alcohol and volatile acidity.
pH, Acidity, Citric Acid: - The variance of regression is alway higher in quality range (0,5) as compared to that of remaining quality range, which might be caused by fewer data points we have in this quality class. We should handle data in this class with extra caution when building prediction models.
When examing the volatile acidity and citric acid distribution among quality classes, surprisingly, I find the citric acid, negatively correlated with volatile acidity, contributes to higher quality scores. The interactions between volatile acidity and citric acid during winemaking, for example, the citric-sugar co-metabolism can increase the formation of volatile acid in wine, can influence the quality results. With citric acid transforming to citric-sugar co-metabolism then producing volatile acid in wine, lower citric acid means more vinegary characters, thus reduce sensory based quality scores of red wines.
We observe various amount of correlations among variables in this dataset. For instance, there is a quite strong positive correlation between total sulfur dioxide and free sulfur dioxide. All thses correlation intuitively make sense. These variables with correlations should be handled carefully in predictive modeling.
## df$quality.class: [0,5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5650 0.6800 0.7242 0.8825 1.5800
## --------------------------------------------------------
## df$quality.class: [5,7)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## df$quality.class: [7,10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4055 0.4900 0.9150
The above correlation martrix shows significant negative correlation between Quality and Volatile Acidity. This boxplot indicates red wines in high quality class tend to have lower volatile acidity content.
## # A tibble: 6 x 3
## quality density_median alcohol.median
## <dbl> <dbl> <dbl>
## 1 1.00 0.998 9.93
## 2 2.00 0.996 10.0
## 3 3.00 0.997 9.70
## 4 4.00 0.997 10.5
## 5 5.00 0.996 11.5
## 6 6.00 0.995 12.2
For red wines with quality scores higher than 3, the median of density generally decreases as the quality score increases. For red wines with quality scores higher than 3, the median of alcohol content increases asthe quality score increases. Across different wine qualities, Density and Alcohol indicate significant negative correlation. High quality red wines cluster in the bottom-right side of the plot, which means good wines tend to have high alcohol conent and low density value.
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine in 2009. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). To understand which chemical properties influence the quality of red wines, I started by understanding the individual variables in the data set, and then I explored interesting correlations and leads as I continued to make observations on plots. Eventually, I explored the relationships between attributes across different quality classes to uncover major drivers that lead to quality differences and some probable interactions between these chemical properities that may influence sensory based quality.
There is a significant correlation between alcohol content of a red wine and its quality. Red wines in good quality class tend to have the highest alcohol content. Unexpectedly, pH does not have a strong correlation with a red wine’s quality but two other acidity related attributes volatile acidity and citric acid. I struggled understanding the reason behind the trend that median volatile acidity decreases gradually as the quality score increases and on the contrary, the content of citric acid increases. But this become more clear when I realized the negative correlation between volitale acidity and citric acid. The interactions between volatile acidity and citric acid during winemaking, for example, the citric-sugar co-metabolism can increase the formation of volatile acid in wine, can influence the quality results. With citric acid transforming to citric-sugar co-metabolism then producing volatile acid in wine, lower citric acid means more vinegary characters, thus reduce sensory based quality scores of red wines. Besides, good quality red wines data cluster in density range between 0.995 and 0.996, which is lower than that of poor quality group. In this stage, I recognize alcohol, volitale acidity, citric acid and density as main features that influence the quality of red wines.
Looking back to above explotary data analysis process, one of limitations is the source of data. Current data consists of samples collected from a specific region and is out of date. Since the quality standards and quality categories vary in different regions. The conclusion drawn from one region could be biased and probably lead to inaccuracies when applying to other regions.
Further, as there are some significant correlation between variables were observed, we should pre-process these variables with extra caution before regression and modling. For example, we may need to eliminate some variables or merge some variables to keep independence of features.